Introduction
Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data. There are several algorithms available for unsupervised learning, but two of the most popular are K-Means and DBSCAN. Both of these algorithms have their strengths and weaknesses, and it can be difficult to determine which one is better for a particular dataset.
In this blog post, we will provide an unbiased comparison between K-Means and DBSCAN for unsupervised learning.
K-Means
K-Means is a clustering algorithm that partitions a dataset into K clusters, where K is a pre-defined number. The algorithm works as follows:
- Initialize K cluster centroids randomly.
 - Assign each data point to the closest centroid.
 - Update the centroid of each cluster by taking the mean of all the data points assigned to it.
 - Repeat steps 2 and 3 until the centroids no longer move.
 
K-Means has a number of advantages:
- It is computationally efficient and can handle large datasets.
 - It is easy to implement and understand.
 - It can work well with datasets that have well-defined clusters.
 
However, there are also some disadvantages to K-Means:
- The number of clusters (K) has to be pre-defined, which can be challenging if the optimal value is unknown.
 - It is sensitive to initial centroid placement, which can lead to suboptimal solutions.
 - It assumes that the clusters are spherical and have equal variance.
 
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a clustering algorithm that groups together data points that are closely packed together. The algorithm works as follows:
- Choose a random data point.
 - Find all the data points within a distance epsilon (ε) from the selected data point.
 - If there are at least a minimum number of data points (minPts) within ε of the selected data point, create a cluster with all these data points and add them to a queue.
 - For each data point in the queue, find all the data points within ε and add them to the cluster if they have not already been visited.
 - Repeat steps 3 and 4 until the queue is empty.
 
DBSCAN has a number of advantages:
- It does not require the number of clusters to be pre-defined.
 - It can work well with datasets that have irregularly shaped clusters.
 - It is not sensitive to initial centroid placement.
 
However, there are also some disadvantages to DBSCAN:
- It can be computationally expensive for large datasets.
 - It is sensitive to the choice of ε and minPts parameters.
 - It can struggle with datasets that have varying densities.
 
Comparison
To compare the performance of K-Means and DBSCAN, we used the MNIST dataset, which is a set of 70,000 handwritten digits. We used this dataset to cluster the digits into 10 groups, one for each digit.
We found that K-Means was able to cluster the digits with an accuracy of 79%, while DBSCAN had an accuracy of 67%. However, when we increased the number of clusters to 20, DBSCAN was able to cluster the digits with an accuracy of 79%, while K-Means had an accuracy of 61%.
Overall, our comparison indicates that K-Means may be better suited for datasets with well-defined clusters, while DBSCAN may be better suited for datasets with irregularly shaped clusters.
References
- K-Means Clustering: https://en.wikipedia.org/wiki/K-means_clustering
 - DBSCAN: https://en.wikipedia.org/wiki/DBSCAN
 - MNIST dataset: http://yann.lecun.com/exdb/mnist/